A Deep Dive into Course Descriptions. Using Quanteda to Identify Work-Based Learning Opportunities
Data@Urban Draft
Authors
Manuel Alcalá and Judah Axelrod
Published
June 8, 2023
This blog post is part two of a series on analyzing work-based learning opportunities in community colleges. In part one, we discussed how we used web scraping to gather course descriptions from community colleges in Florida. Now, we’ll delve into how we analyzed this data using the quanteda package in R.
1 Introduction
In our previous post, we detailed our journey of collecting course descriptions from Florida’s community colleges using web scraping techniques. We successfully compiled a comprehensive dataset, but the question remained: how do we make sense of this vast amount of text data? Enter quanteda, an R package designed for quantitative text analysis.
2 Getting Started with Quanteda
Quanteda, short for Quantitative Analysis of Textual Data, is a powerful tool for managing and analyzing text data in R. It provides a suite of functions for corpus management, creating document-feature matrices, analyzing keywords, and more. These functions are highly performant and have a consistent interface with support for multiple languages. While it is a standalone package, it has many extensions such as readtext, spacyr, and quanteda.textstats
We can load and install the required packages using the librarian package for convenience as follows:
Code
librarian::shelf(tidyverse, # data wrangling quanteda, # text mining readtext # Read texts and associated document-level meta-data)
The 'cran_repo' argument in shelf() was not set, so it will use
cran_repo = 'https://cran.r-project.org' by default.
To avoid this message, set the 'cran_repo' argument to a CRAN
mirror URL (see https://cran.r-project.org/mirrors.html) or set
'quiet = TRUE'.
We can then load our text data containing course descriptions as well as document-level meta-data using the readtext::readtext() function. We performed some more cleaning steps to standarize variable names and only analyze active courses for credit.
The first step in our analysis was to create a corpus, a collection of text documents, from our course descriptions. We can do this using the corpus() function in quanteda. Next, we extract the tokens in the corpus—these are usuallly words but can also be n-grams or multi-word expressions. The tokens functions enables us to define what we mean by tokens and apply some rules to ignore elements such as punctuation and digits.
Code
corpus <-corpus(courses)
Corpus consisting of 5 documents and 29 docvars.
BC-THE-2300 :
"A STUDY OF DRAMATIC LITERATURE FROM THE TIME OF THE EARLY GR..."
BC-JST-1500 :
"A SURVEY OF JEWISH CULTURE (JST1500) IS AN EXAMINATION OF JE..."
BC-LEI-1700 :
"AN OVERVIEW OF THE CHARACTERISTICS AND NEEDS OF MEMBERS OF S..."
BC-JOU-2200 :
"COURSE PROVIDES INSTRUCTION AND PRACTICAL EXPERIENCE IN COPY..."
BC-FRE-1121 :
"CONTINUATION OF FRE 1120. FURTHER DEVELOPMENT OF THE BASIC S..."
Code
tk <-tokens(corpus, what ="word", remove_punct =TRUE, remove_numbers =TRUE)
Tokens consisting of 5 documents and 29 docvars.
BC-THE-2300 :
[1] "A" "STUDY" "OF" "DRAMATIC" "LITERATURE"
[6] "FROM" "THE" "TIME" "OF" "THE"
[11] "EARLY" "GREEKS"
[ ... and 56 more ]
BC-JST-1500 :
[1] "A" "SURVEY" "OF" "JEWISH" "CULTURE"
[6] "JST1500" "IS" "AN" "EXAMINATION" "OF"
[11] "JEWISH" "THOUGHT"
[ ... and 18 more ]
BC-LEI-1700 :
[1] "AN" "OVERVIEW" "OF" "THE"
[5] "CHARACTERISTICS" "AND" "NEEDS" "OF"
[9] "MEMBERS" "OF" "SPECIAL" "GROUPS"
[ ... and 12 more ]
BC-JOU-2200 :
[1] "COURSE" "PROVIDES" "INSTRUCTION" "AND" "PRACTICAL"
[6] "EXPERIENCE" "IN" "COPY" "EDITING" "REWRITING"
[11] "HEADLINE" "WRITING"
[ ... and 19 more ]
BC-FRE-1121 :
[1] "CONTINUATION" "OF" "FRE" "FURTHER" "DEVELOPMENT"
[6] "OF" "THE" "BASIC" "SKILLS" "IN"
[11] "SPEAKING" "LISTENING"
[ ... and 51 more ]
3 Key-Term Searches with a Dictionary
Our primary interest was to identify courses related to different types of work based learning such as internships, apprenticeships, or practicums. For each type of work based learning experience we create a list of terms that we want to treat equivalently. For instance, our dictionary can specify that a course description refers to a clinical WBL if either of the terms “clinicals” or “clinical experience” appear.
Now we’re ready to search for the terms in our dictionary using the kwic() (key-word in context) function. It takes in a corpus and a dictionary as inputs as well as a window parameter specifying the amount of tokens before and after a keyword that we want to see for context.
Warning: 'kwic.corpus()' is deprecated. Use 'tokens()' first.
Lastly, we can join the results from the dictionary-based keyword in-context search back to the course level data and perform some wrangling to analyze the prevalence of work-based learning opportunities in Florida’s community colleges.
In this post, we’ve demonstrated how to use quanteda to analyze course descriptions and identify work-based learning opportunities in community colleges. While our analysis focused on Florida, the same methods could be applied to other states or regions.
Through this analysis, we’ve gained valuable insights into the prevalence of work-based learning in Florida’s community colleges. We hope that our work can serve as a foundation for further research and policy discussions on this important topic.
In the end, the power of quanteda lies not justin its ability to handle large text data, but also in its flexibility. It allows researchers to tailor their analysis to their specific needs, whether that’s identifying key terms, comparing text documents, or exploring text patterns.
This blog post was co-authored by Judah Axelrod and Manuel Alcala Kovalski.